What is claimed is: 



1 . An audio-visual system for processing video data 
comprising: 

an object detection module capable of providing a 
plurality of object features from the video data; 

an audio processor module capable of providing a 
plurality of audio features from the video data; 

a processor coupled to the object detection and the 
audio segmentation modules, 

wherein the processor is arranged determine a 
correlation between the plurality of object features and 
the plurality of audio features. 

2. The system of Claim 1, wherein the processor is 
further arranged to determine whether an animated object in 
the video data is associated with audio. 

3. The system of Claim 2, wherein the plurality of audio 
features comprise two or more of the following average 
energy, pitch, zero crossing, bandwidth, band central, roll 
off, low ratio, spectral flux and 12 MFCC components. 

4. The system of Claim 2, wherein the animated object is a 
face and the processor is arranged to determine whether the 
face is speaking. 

5. The system of Claim 4, wherein the plurality of image 
features are eigenfaces that represent global features of 
the face. 
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6. The system of Claim 1, further comprising a latent 
semantic indexing module coupled to the processor and that 
preprocesses the plurality of object features and the 
plurality of audio features before the correlation is 
performed. 

7. The system of Claim 6, wherein the latent semantic 
indexing module includes a singular value decomposition 
module . 

8. A method for identifying a speaking person within 
video data, the method comprising the steps of: 

receiving video data including image and audio 
information; 

determining a plurality of face image features from 
one or more faces in the video data; 

determining a plurality of audio features related to 
audio information; 

calculating a correlation between the plurality of 
face image features and the audio features; and 

determining the speaking person based upon the 
correlation. 

9. The method according to Claim 8, further comprising 
the step of normalizing the face image features and the 
audio features. 

10. The method according to Claim 9, further comprising 
the step of performing a singular value decomposition on 
the normalized face image features and the audio features. 
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11. The method according to Claim 8, wherein the 
determining step includes determining the speaking person 
based upon the one or more faces that has the largest 
correlation. 

12. The method according to Claim 10, wherein the 
calculating step includes forming a matrix of the face 
image features and the audio features. 

13. The method according to Claim 12, further comprising 
the step of performing an optimal approximate fit using 
smaller matrices as compared to full rank matrices formed 
by the face image features and the audio features. 

14. The method according to Claim 13, wherein the rank of 
the smaller matrices is chosen to remove noise and 
unrelated information from the full rank matrices. 

15. A memory medium including code for processing a video 
including images and audio, the code comprising: 

code to obtain a plurality of object features from the 
video; 

code to obtain a plurality of audio features from the 
video; 

code to determine a correlation between the plurality 
of object features and the plurality of audio features; and 

code to determine an association between one or more 
objects in the video and the audio. 

16. The memory medium of Claim 15, wherein the one or more 
objects comprises one or more faces. 
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17. The memory medium of Claim 16, further comprising code 
to determine a speaking face. 

18. The memory medium of Claim 15, further comprising code 
create a matrix using the plurality of object features and 
the audio features and code to perform a singular value 
decomposition on the matrix. 

19. The memory medium of Claim 18, further comprising code 
to perform an optimal approximate fit using smaller 
matrices as compared to full rank matrices formed by the 
object features and the audio features. 

20. The memory medium according to Claim 19, wherein the 
rank of the smaller matrices is chosen to remove noise and 
unrelated information from the full rank matrices. 
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